A Partitioned Similarity Search with Cache-Conscious Data Traversal
نویسندگان
چکیده
All pairs similarity search (APSS) is used in many web search and data mining applications. previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-phase approach called Partition-based Similarity Search (PSS). The first phase is to partition the data and group vectors that are potentially similar. The second phase is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Due to data sparsity and the presence of memory hierarchy, accessing feature vectors during the partition comparison phase incurs significant overhead. This paper introduces a cache-conscious design for data layout and traversal to reduce access time through size-controlled data splitting and vector coalescing, and it provides an analysis to guide the choice of optimization parameters. The evaluation results show that for the tested datasets, the proposed approach can lead to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency with one order of magnitude of performance improvement and it can also be integrated with LSH for approximated APSS.
منابع مشابه
CC-GiST: Cache Conscious-Generalized Search Tree for Supporting Various Fast Intelligent Applications
According to the advance of technologies, the speed gap between CPU and main memory is getting larger every year. Due to the speed gap, it was perceived important to make the most use of the cache residing between CPU and main memory, and there have been a lot of research efforts on this issue. Among those is the research on cache conscious trees for reducing the cost for accessing main memory ...
متن کاملThe Dense Skip Tree: A Cache-Conscious Randomized Data Structure
We introduce the dense skip tree, a novel cache-conscious randomized data structure. Algorithms for search, insertion, and deletion are presented, and they are shown to have expected cost O(logn). The dense skip tree obeys the same asymptotic properties as the skip list and the skip tree. A series of properties on the dense skip tree is proven, in order to show the probabilistic organization of...
متن کاملStreaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing Citation
Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kdtrees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper,...
متن کاملEffect of Node Size on the Performance of Cache-Conscious Indices
In main-memory environments, the number of processor cache misses has a critical impact on the performance of the system. Cache-conscious indices are designed to improve the performance of mainmemory indices by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests that the index’s node size should be equal to the cache line size ...
متن کاملIndex Search Algorithms for Databases and Modern CPUs
Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+-trees, k-ary binary search, and fast architecture sensitive tree search. There have also been papers on how best to set the many different parameters of these index structures, such as the node size of CSB+-trees. These indices have been proposed because CPU speeds have been in...
متن کامل